Benchmarking domain-specific compiler 
optimizations for variational forms 



Robert C. Kirby 
Texas Tech University 
and 

Anders Logg 

Center for Biomedical Computing, Simula Research Laboratory 
Department of Informatics, University of Oslo 



We examine the effect of using complexity-reducing relations [Kirby et al. 2006] to generate op- 
timized code for the evaluation of finite element variational forms. The optimizations arc imple- 
mented in a prototype code named FErari, which has been integrated as an optimizing backend 
to the FEniCS Form Compiler, FFC [Kirby and Logg 2006; 2007]. In some cases, FErari provides 
very little speedup, while in other cases, we obtain reduced local operation counts of a factor of 
as much as 7.9 and speedups for the assembly of the global sparse matrix of as much as a factor 
of 2.8 (see Figure 9). 
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1. INTRODUCTION 

Projects such as the FEniCS Form Compiler (hence, FFC) [Kirby and Logg 2006; 
2007; Logg 2007], Sundance [Long 2003; 2004; 2006], and deal.II [Bangerth et al. 
2006] aim to automate important aspects of finite element computation. In the case 
of FFC, low- level code is generated for the evaluation of element stiffness matrices or 
their actions, together with the local-global mapping. The existence of such a com- 
piler for variational forms naturally leads one to consider an optimizing compiler for 
variational forms. What mathematical structure in the element-level computations 
is tedious for humans to exploit by hand, but possible for a computer to find? We 
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have provided partial answers to this question in a series of papers [Kirby ct al. 
2005; Kirby et al. 2006; Kirby and Scott 2007]. These ideas have been implemented 
in a prototype code called FErari, and we provide an empirical study of the opti- 
mizations implemented by FErari in this paper. Both FFC and FErari are part of 
the FEniCS project; for more information about the software, we refer readers to 
the project web page [Logg et al. 2007]. 

FFC takes as input a multilinear variational form and generates code for evalu- 
ating that form over affine elements. The formation of the local stiffness matrix on 
a single element is expressed as a linear transformation (known at compilc-timc) 
applied to a vector representing the geometry and coefficient data (known only at 
run-time). The linear transformation depends on the variational form and finite 
element basis, but not on the mesh. This means that the cost of generating and 
optimizing the code is independent of the size of mesh, but depends strongly on 
the complexity of the variational form and polynomial degree used. The generated 
code is completely unrolled. This internal kernel is then called for each of the many 
elements of the mesh at run-time to compute the global sparse matrix. FFC also 
supports a mode that calls level 2 BLAS [Dongarra et al. 1988] rather than gen- 
crating unrolled code. This typically gives comparable run-time performance and 
smaller cxccutablcs. However, the optimizations we consider here are only possible 
to apply in the context of unrolled code. 

To a user of FFC, the optimizations are invoked simply with a -0 flag, which 
turns on a call to FErari and thence a modified code generator. It is important 
to note that the optimizations considered are similar to, but typically beyond the 
abilities of general-purpose compilers to detect. In assessing the efficacy of these 
techniques at reducing run-time, we focus on the construction of the sparse matrix 
and its matrix-free application for a variety of variational forms. In particular, we 
study the "pure" effect of the FErari optimizations as well as the optimizations 
relative to the cost of inserting into a sparse matrix data structure. 

While several fairly theoretical papers [Kirby et al. 2005; Kirby ct al. 2006; 
Kirby and Scott 2007] have shown that reductions in arithmetic cost are possible 
to obtain, there are only very limited tests of the practical impact of the proposed 
optimizations. With some notable exceptions, such as reported in Figure 9 below, 
the optimizations provide somewhat disappointing empirical results, such as only 
a few percent speedup. However, it is still important to include these tests in the 
literature to bring some completeness to the theoretical work. In many cases, the 
poor speedups are due to local computation (what we optimize) being dominated 
by the cost of insertion into global sparse data structures. As memory access is 
typically very slow compared to floating point arithmetic, this may not be surpris- 
ing. However, it is possible that the optimizations considered here could perform 
better in practice in other situations with lower memory traffic, such as element-by- 
element or static condensation techniques. That said, one does obtain significant 
global speedups in some cases. For the set of test cases examined below, we obtain 
a factor of 2.8 global speedup for the assembly of the global sparse matrix of the 
weighted advection operator for quartics on tetrahedra (Figure 9). 
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2. FINITE ELEMENT ASSEMBLY AND THE ELEMENT TENSOR 

In finite elements, the nonlinear and linear algebraic problems come from evaluating 
the variational forms on the finite element basis functions. In our work on FFC 
and FErari, we have focused on evaluating multilinear forms over affine elements, 
and wc continue to do so here. 

The typical example is the bilinear form for Poisson's equation, 

a(v,u) = / Vv-Vudx. (1) 
Jn 

If {4>j}jLi is a finite element basis defined on some triangulation T of the domain 
f2, the global stiffness matrix is 

Ai = aO*!,^), (2) 

where i = i%) is a multiindcx. 

The standard algorithm [Zicnkicwicz et al. 1967; Hughes 1987; Langtangen 1999] 
for computing the matrix A is known as assembly; it is computed by iterating over 
the cells of the mesh T and adding from each cell the local contribution to the 
global sparse matrix A. A similar process can compute a global action, in which A 
is applied to some vector u without explicitly forming A. 

The integral defining a multilinear form a may be written as a sum of integrals 
over the cells K of a triangulation 7~~ of the domain f2: 

a= ^2 a K , (3) 

and thus 

M = a K (<j} h ,(P i2 ). (4) 

For Poisson's equation, the element bilinear form clk is thus given by o,k{v,u) = 
J K X7v ■ Vu dx. Finite element bases are constructed so that each ax is zero except 
for a few basis functions. 

For affine elements, as wc consider here, the shape functions are constructed 
once on a reference element Kq and mapped to each element of the mesh via an 
affine mapping Fx- In doing so, one must construct a "local-global mapping" that 
relates an ordering of the element shape functions to the global basis functions. The 
contribution of element K to the global matrix A is then evaluated in two stages. 
First, a dense element matrix is computed by evaluating ax on the shape functions 
for K. We call this element matrix A . Then, each entry of A K is summed into 
the appropriate location in the global sparse matrix as defined by the local-global 
mapping. The first stage is dominated by floating point computation, the second 
requires more substantial memory access. 

Our work in [Kirby and Logg 2006; 2007] has focused on a general paradigm 
for efficiently constructing A K . It has long been known that precomputing certain 
integrals on the reference element can speed up computation of the element tensor, 
especially for bilinear forms with straight-sided elements. A general approach to 
precomputing certain integrals was first introduced in [Kirby ct al. 2004; Kirby 
et al. 2005] and later formalized and automated in [Kirby and Logg 2006; 2007]. 
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A similar approach was implemented in early versions of DOLFIN [Hoffman and 
Logg 2002; Hoffman et al. 2006a; 2006b], but only for piecewise linear elements. 




Xi = (0,0) X 2 = (1,0) 

Fig. 1. The (affine) mapping Fk from a reference cell Kq to some cell K £ T. 



As an example, we consider here the computation of the element matrix A K for 
the Laplacian. When the mapping Fk from the reference cell is affine (Figure 1), 
we have for the Laplacian 



Jk Jk oxp dxp 



whence a change of variables yields 



(5) 



(6) 



aeA 



where A and Ik are sets of allowed multiindiccs (depending on the spatial dimen- 
sion and the discrctizing polynomial spaces). More simply, we can write 



where 



A®^ — 



A K = A : G K , 



' k 9X ai dX a2 
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We refer to the tensor A as the reference tensor and to the tensor Gk as the 
geometry tensor. For more details and extensions of this notation to a wide class of 
multilinear forms, we refer the reader to our previous work [Kirby and Logg 2006; 
2007]. 

In [Kirby ct al. 2005; Kirby ct al. 2006; Kirby and Scott 2007], we have explored 
special mathematical structure that leads to reduced operation counts. However, 
it was studied only in a limited case what the net impact of FErari optimizations 
when the cost of global assembly is counted as well. 

3. A FRAMEWORK FOR OPTIMIZATION 

In this section, we present an overview of our framework for optimization of vari- 
ational form evaluation. Two different approaches are presented. The first is a 
coarse-grained strategy based on phrasing the tensor contraction (7) as a matrix- 
vector or matrix-matrix multiplication that may be computed by an optimized 
library call. The second, which is what FErari implements, exploits the structure 
of the tensor contraction to find an optimized computation with a reduced operation 
count. 

3.1 Tensor contraction as a matrix-vector product 

To evaluate the clement tensor A , one must evaluate the tensor contraction (7). A 
simple approach would be to iterate over the entries {Af} ie x K °f A K and for each 
entry Af compute the value of the entry by summing over the set of indices A. 
However, by an appropriate reshaping of the tensors A K , A and Gk, one may 
phrase the tensor contraction as a matrix-vector product and call an optimized 
library routine for the computation of the matrix-vector product, such as the level 
2 BLAS routine DGEMV. We write matrix- vector product as as a K = A°gK, where 
a K and gx are A K and Gk reshaped into vectors and A is A reshaped into a 
matrix. 

Of course, once the computation of one a K may be computed as a matrix- vector 
product, the computation of {a Ki }fL l for some M elements of the mesh can nat- 
urally be encoded as a matrix-matrix multiplication. Using DGEMM in such a 
context is an example of coarse-grained optimization, making good use of cache 
in a large computation. Such an approach necessarily overlooks problem-specific 
optimizations such as we find in FErari, but may be very effective in many cir- 
cumstances. It is to be expected that which approach is preferable will depend 
strongly on how much structure FErari finds and how well the resulting algorithms 
are mapped onto hardware, as well as whether the computation is large enough for 
DGEMM to have good performance. We do not explore the coarse-grained strategy 
further in this paper. 

3.2 Complexity-reducing relations 

The matrix A is computed at compile-time by FFC, and it typically possesses sig- 
nificant structure that can be exploited to reduce the amount of arithmetic needed 
to multiply it by a vector gK at run-time. It is also helpful to think of the product 
A°gK as a collection of vector dot products, where vectors a° are the rows of A . 

As an example, we consider forming the weak Laplacian on triangles using 
quadratic Lagrange basis functions. A is shown in Table I. We have displayed 
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the index into the unflattcned A in the first column, and the rest of row i is the 
flattened vector a®. So, the process of forming A K for some triangle K is first to 
compute the geometry vector gx and then to form the matrix- vector product A gx ■ 
In this case, we will obtain a vector a K of length 36, which will be reshaped to the 
6x6 element tensor A K . This is then inserted into the global stiffness matrix via 
the local-global mapping. 
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Table I. The flattened reference tensor for quadratic Lagrange elements on triangles. The first 
column gives the index of the clement tensor to which the row corresponds, and the rest of the 
columns in the row are the entries of the flattened vector. 

To optimize the evaluation of the element tensor, we look for dependencies be- 
tween the vectors {a®}i<z.x K . or equivalently the rows of A that can be used to 
reduce the cost of forming the matrix-vector product. We may only look for struc- 
ture in {a°}igXjci as the gx vectors are only known at run-time. For example, if 
two vectors a° and a° are collinear (such as the rows (1,0) and (1,5) in Table I), 
then a° • gx may be computed using a° • gx in only one multiply, and vice versa. If 
the Hamming distance (number of different entries between a and a ) is k, then 
the result a° • gx can be computed from a° • gx in about k multiply-add pairs, and 
vice versa. These kinds of relations are called "complexity-reducing relations" , and 
they arc related to common subexpressions. Note that using such a relationship 
requires that the code for the dot products be unrolled. As with FFC, there may 
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come a point at which code bloat outweighs gains in arithmetic cost, but we remark 
that code optimized by FErari contains fewer arithmetic operations and hence is 
smaller than the standard FFC output, but much larger than using the BLAS mode 
of FFC. 

In [Kirby ct al. 2006], we constructed a weighted, undirected graph, the vertices 
of which were the vectors a® and the weights of whose edges were the pairwisc 
distances under a complexity-reducing relation (the cost of computing one entry in 
the element matrix from another) . We proved that a minimum spanning tree of this 
graph encodes a minimal-arithmetic (in a specific sense) algorithm for evaluating 
the product of ^4° with an arbitrary input vector. 

In Figure 2, we show the dependency graph generated by FErari. The arrows 
indicate dependency rather than implication. That is, the arrow from (0,0) to (1,1) 
indicates that the result of computing y\9K is used to compute a° ^gK- Hence, 
the implied flow of computation is from right to left, and disconnected components 
in the graph are independent of each other. 

As one extension of this technique, we notice that many of the vectors may be 
computed effectively by ignoring multiplication by zero. For example, entry (1,3) 
in Table I only has one nonzero entry. It makes sense to generate code for forming 
a (i 3 )<?if explicitly instead of using a complexity-reducing relation. In this case, we 
have "snipped" the edge from the entry (1,3) to its parent in the minimum spanning 
tree before generating code and thus this entry has no outgoing arrows. Hence, we 
properly have a forest rather than a tree. 

Many other kinds of structure may be found in ^4°. For example, in many cases 
one can prove that the tensor has symmetries along certain axes. We used this, 
for example, in [Kirby et al. 2005; Kirby et al. 2006], but have yet not automated 
the detection of such structure. Also, frequently three or more rows of A will be 
linearly dependent. A first attempt at exploiting this structure is found in [Kirby 
and Scott 2007], but our present work is limited to complexity-reducing relations. 

4. BENCHMARK RESULTS 

For a range of forms and polynomial degrees, we report several quantities for form- 
ing the matrix and its action. First, we report the base operation count \Ik \ \A\ 
for forming the element tensor A , as well as the operation counts generated by 
FFC 1 and the FErari optimizations. Having generated code for the local element 
computation from both FFC and FErari, we compare the run-time for these codes 
being executed several times. This measures the efficacy of FErari at exactly the 
point it seeks to optimize. Then, to provide a broader context, we present the 
speedup obtained in the global assembly process, when the overhead of sparse data 
structures is included. 

In each case, we generated code for the local and global computation both with 
and without FErari optimizations. This code was compiled and run on an IBM 
Thinkpad T60p with 2GB of RAM and a dual core Intel T2600 chip running at 
2.16 GHz. The operating system was Ubuntu Linux with kernel 2.6.17-10-386. The 
compiler was g++ version 4.1.2 using optimization flag -02 on all variational forms 



1 FFC reduces the base operation count by omitting computation of zeros when the element tensor 
is sparse. 
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Fig. 2. Dependency graph for forming the element stiffness matrix for the Laplacian using 
quadratic Lagrange triangles as determined by FErari. 
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except the weighted Laplacian operator and action using quartics in 3D. The com- 
piler and machine could only handle optimization mode -00 in these cases. This 
illustrates a challenge with our approach to finite element code generation based on 
the tensor representation (7). Since straight- line code is generated for the compu- 
tation of the element tensor, complicated forms or high-dimensional finite element 
spaces may lead to generation of large amounts of code which the C++ compiler 
is not able to handle, particularly in optimized mode. For these forms, generating 
code based on quadrature rather than tensor contraction with FFC/FErari could 
be more practical. 

For two-dimensional problems, we used a regular triangulation based on subdi- 
viding a 64 x 64 square mesh into right triangles, resulting in a total of 4,225 vertices 
and 8,192 triangles. For three dimensions, we used a 16 x 16 x 16 partition of the 
unit cube into 4,913 vertices and 24,576 tetrahedra. The timing was performed 
adaptively to ensure that at least one second of CPU time elapsed for a set of at 
least ten repetitions for each test case. For the sparse matrix data structure, a 
simple stcl: : vector<std: :map<unsigned int , double> > was used, which was 
found competitive with insertion into a sparse PETSc matrix. 

In most cases, we find decent speedup in the operation count, although it does not 
always translate into a speedup in the runtime for the local computation. FErari 
is currently architecture-unaware. Rearranging the matrix-vector computation in a 
way that makes poor use of registers, for example, can more than offset reductions 
in the actual amount of arithmetic. A better result would be obtained by somehow 
combining the graph-based optimizations with an architecture model, or using a 
special-purpose compiler such as Spiral [Piischcl et al. 2005]. 

Moreover, even a speedup in local computation does not always improve the 
global cost of assembling a matrix or vector. If a relatively small amount of work 
is required to compute A , then the cost of assembling it into the global matrix 
or vector may dominate; reductions in arithmetic are not significant. On the other 
hand, when the construction of A K is relatively expensive, then speedup in the 
construction of the global matrix or vector can be realized by reduction of arithmetic 
in the local computation. In our empirical results, we observe a tendency of FErari 
to provide better global speedups for more complicated variational forms. 

4.1 Laplacian 

First, we consider the Laplacian, with the variational form 



We use Lagrange polynomials of degree k — 1,2,. ..,5 on triangles and degree 
k = 1, 2, . . . , 4 on tetrahedra. 2 

In each case, FErari provides up to about a factor of three improvement in 
operation count. The reduction in operation count, local computation time, and 
global computation time required is plotted in Figure 3. The reduction in arithmetic 
reduces the run-time to evaluate the local stiffness matrix (multiplying by §k) by 



2 The polynomial degree on tetrahedra was limited by available resources to compute the opti- 
mization. 




(9) 
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Fig. 3. Speedup in operation count, local run-time and global run-time for using FErari versus 
FFC only for the Laplacian (9). 



a factor of 1.5 to 2 in both two and three dimensions. However, the reduction does 
not have a major impact on the global time to assemble the matrix. In this case, 
there are very few arithmetic operations needed to construct the local matrix, and 
the cost of inserting into the global matrix overshadows the gains FErari provides. 

We also consider the matrix action as needed in a Krylov solver. Assembling 
into a global vector is less expensive than into a global matrix, and we see better 
speedups in evaluating the action of the Laplacian operator. In this case, FFC 
and FErari generate code for evaluating (9) with u a member of the finite element 
space. Speedup of this operation is felt at each iteration of a Krylov method and so 
translates directly into decreased solve time. The matrix A has the same entries 
as for forming the stiffness matrix, but has a different shape. In this case, the shape 
is |Pfc| x (d 2 |Pfc|). Note that FErari does not do as well for the action as for forming 
the matrix. Although the entries of A are the same as before, the difference in 
shapes complicates finding collinear relationships. When the rows have only d 2 (4 
or 9) entries for the stiffness matrix, more collinearity is found than when there 
are |Pfc| times as many entries. However, finding Hamming distance relations is as 
effective as before. Despite the smaller reduction in operation count, the effect of 
the optimizations on run-time is much greater than in forming the matrix, as we can 
see by comparing Figure 4 to Figure 3. A global speedup of about 10% is observed 
for degrees three through five in two dimensions, and a speedup of 20%-40% for 
quadratics through quartics in three dimensions. Again, only a small improvement 
is observed for low order methods. 
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Fig. 4. Speedup in operation count, local run-time and global run-time for using FErari versus 
FFC only for the action of the Laplacian (9). 

4.2 Weighted Laplacian 

Now, we consider the form 



for a fixed weight w where we assume that v, u, w all come from the same Lagrange 
finite element space. In this case, the presence of the coefficient w makes the 
local form more expensive to evaluate. The matrix Aq now has \Pk\ 2 rows and 
d 2 \Pk \ columns. However, the graph of the global matrix for this form is the same 
as for the constant coefficient case, assuming the same basis and mesh are used. 
Consequently, the cost of assembly is exactly the same once A K is constructed. 

Again, FErari reduces the operation count and run-time for the local computation 
considerably. Given that the arithmetic cost is much larger than for the constant- 
coefficient case, it is not surprising that the global speedups are much better, as 
seen in Figure 5. 

As before, A has the same entries but a different shape when the action of the 
form is considered. Now, the shape is \Pk\ x [d 2 \Pk\ 2 )- While FErari does not 
reduce the operation count for the matrix action as significantly as it does for the 
matrix itself, the global speedups are more significant (Figure 6). 

4.3 Advection 

Next, we consider the advection operator 




(10) 




(11) 



ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY. 



12 • R. C. Kirby and A. Logg 




Fig. 5. Speedup in operation count, local run-time and global run-time for using FErari versus 
FFC only for the weighted Laplacian (10). 



Weighted Laplacian action 




Polynomial degi 



Fig. 6. Speedup in operation count, local run-time and global run-time for using FErari versus 
FFC only for the action of the weighted Laplacian (10). 
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Fig. 7. Speedup in operation count, local run-time and global run-time for using FErari versus 
FFC only for the advection operator (11). 

where (3 is some constant vector and consider forming the global stiffness matrix 
and its action. For the matrix, the dimension of A is \Pk\ 2 xd 3 . The advection j3 is 
defined as a piecewise constant vector- valued Lagrange function which has d degrees 
of freedom on each element. As a result, the matrix A is physically of dimension 
|Pfe| 2 x d 3 , but the number of nonzero elements scales like \Pk\ 2 x d 2 . This is because 
the reference tensor A generating the matrix A is formed as an outer product with 
$ai[o:2] = ^02! that is, component ai of the piecewise constant vector- valued 
basis function $ Ql . Precontracting the reference tensor along dimensions a\,a.2 
would thus reduce the size of the matrix A to \Pk\ 2 x d 2 . Low-order elements 
like piecewise constants and linears often generate particular structures that can be 
used for further optimizations. Such optimizations are not handled by FErari and 
are an interesting venue for further research. 

As with forming the Laplacian, the reduced operation counts do not significantly 
affect the global runtime (Figure 7). The operation counts and speedups for the 
matrix action are found in Figure 8. Global speedup is again most significant for 
higher order elements in three dimensions. 

4.4 Weighted advection in a coordinate direction 

Finally, we consider the advection operator oriented along a coordinate axis, but 
with the velocity field varying in space (projected into the finite element space): 



We consider forming the matrix and its action for a fixed weight w. This oper- 
ator is a portion of the trilinear momentum advection term in the Navier-Stokes 




(12) 
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Fig. 8. Speedup in operation count, local run-time and global run-time for using FErari versus 
FFC only for the action of the advection operator (11). 

equations. For constructing the matrix, we observe a nice speedup in local com- 
putation, although in two dimensions this has only a marginal effect on the global 
run-time for assembly. However, we gain significantly for higher-order elements in 
three dimensions, where we see a global speedup with 180% (a factor 2.8) for quar- 
tics. The operation counts for the local matrix construction and action are shown 
in Figures 9 and 10. 

4.5 Speedup versus work 

As we noted before, reducing floating-point arithmetic is expected to be more sig- 
nificant to the global computation when the individual entries in the local matrix 
or vector are already expensive to compute. As a test of this, we plot the speedup 
of FErari over FFC against the number of columns in each reference operator A in 
Figure 11. We do this for all orders and forms, considering matrices and their ac- 
tions separately. Although it is not an exact relation (as to be expected), Figure 11 
does indicate a general trend of speedup increasing with the base cost of work per 
entry. 

4.6 Compile times 

It is important to quantify the additional compilc-time cost of using FErari within 
FFC. In some situations, especially in a just-in-time compilation, the significant 
additional cost will outweigh the potential run-time gains. In this section, we 
report compile times for a few forms as an example. It should be remembered, 
however, that FErari is currently implemented in Python and far from tuned for 
performance. A better implementation should improve these compile times. 

Tables II and III give the compile times for FFC without and with FErari op- 
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Fig. 9. Speedup in operation count, local run-time and global run-time for using FErari versus 
FFC only for the weighted advection operator (12). 




Fig. 10. Speedup in operation count, local run-time and global run-time for using FErari versus 
FFC only for the action of the weighted advection operator (12). 
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Fig. 11. The global speedup that FErari produces over FFC is plotted against the number of 
columns in the associated reference matrix A , which is a measure of the work required to compute 
each entry of A K . 



Form 


Degree 


FFC 


GCC 


GCC -02 


Laplacian operator 


1 


0.016 


2.3 


2.3 


Laplacian operator 


2 


0.035 


2.2 


2.5 


Laplacian operator 


3 


0.13 


2.5 


3.7 


Weighted Laplacian operator 


1 


0.029 


2.2 


2.4 


Weighted Laplacian operator 


2 


0.26 


2.8 


5.2 


Weighted Laplacian operator 


3 


2.3 


9.1 


130 



Table II. Compile times in seconds for FFC, GCC and GCC with optimization -02 for a set of 
forms. 



timizations respectively. We also report the time for compiling the CH — h code 
generated by FFC with GCC (g++). We note a few interesting details from these 
numbers. First, we note that the FErari optimizations may take considerable time, 
in particular for high degree polynomials and forms containing coefficients. Fur- 
ther, we note that it may also take considerable time to compile the generated 
code. Finally, we note that GCC may in some cases run faster if the generated 
code has already been optimized by FErari. This gain is small compared to the 
cost of running FErari, and is directly attributable to the resulting unrolled code 
having fewer operations. 

5. CONCLUSIONS 

Several things emerge from our empirical study of optimizing FFC with FErari. 
In certain contexts, FErari can provide tens of percent to a few times speedup in 
runtime in forming or applying stiffness matrices. Moreover, these cases tend to 
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Form 


Degree 


FFC -O 


GCC 


GCC -02 


Laplacian operator 


1 


0.12 


2.1 


2.3 


Laplacian operator 


2 


4 


2.2 


2.5 


Laplacian operator 


3 


68 


2.4 


3.3 


Weighted Laplacian operator 


1 


0.23 


2.2 


2.4 


Weighted Laplacian operator 


2 


22 


2.6 


4.5 


Weighted Laplacian operator 


3 


760 


7.2 


78 



Table III. Compile times in seconds for FErari-optimizcd FFC, GCC and GCC with optimization 
-02 for a set of forms. 

be the computationally harder ones (three dimensions, higher order polynomials). 
However, FErari is not without its costs. It dramatically adds to the compile-time 
for FFC, and when used for simple forms can actually hinder runtime. 

Besides improving the run-time performance of finite element codes generated by 
FFC and FErari, our results shed some light on where FErari could be improved 
and in how a fully functional optimizing compiler for finite elements might be de- 
veloped. First, our calculations did little to optimally order the degrees of freedom; 
better ordering algorithms should decrease the cost of insertion. Second, algorithms 
trying to maximize performance must have some awareness of the underlying com- 
puter architecture. The success of Spiral in signal processing suggests this should 
be possible. Moreover, knowing when to do what kinds of optimization, such as 
FErari's fine-grained optimization versus a coarse-grained level 3 BLAS approach, 
must be determined. This must also be compared against when quadrature-based 
algorithms might be effective, as well as whether the stiffness matrix should be 
explicitly constructed, statically condensed, or applied without being constructed. 
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